Parse Module

Overview

Why is parsing needed?
The data capture system does not have fields for each distinct piece of information with a distinct use, leading to user workarounds, such as entering many distinct pieces of information into a single free text field, or using the wrong fields for information which has no obvious place (for example, placing company information in individual contact fields).

The data needs to be moved to a new system, with a different data structure.

Duplicates need to be removed from the data, and it is difficult to identify and remove duplicates due to the data structure (for example, key address identifiers such as the Premise Number are not separated from the remainder of the address).

Alternatively, the structure of the data may be sound, but the use of it insufficiently controlled, or subject to error.

For example:
  1. Users are not trained to gather all the required information, causing issues such as entering contacts with ‘data cheats’ rather than real names in the name fields..
  2. The application displays fields in an illogical order, leading to users entering data in the wrong fields.
  3. Users enter duplicate records in ways that are hard to detect, such as entering inaccurate data in multiple records representing the same entity, or entering the accurate data, but in the wrong fields.

These issues (this is NOT an exhaustive list) all lead to poor data quality, which may in many cases be costly to the business. It is therefore important for businesses to be able to analyze data for these problems, and to resolve them where necessary. [Source]

Key Concepts

Data Types
Basic Primitive Types
  1. Character (char)
  2. String
  3. Integer (integer, int, short, long, byte) with a variety of precisions
  4. Floating-point number (float, double, real, double precision)
  5. Boolean
  6. Alphanumeric

Basic Data Types Example
Integer An integer number, from -2147483648 to 2147483647
Double, Real or Float A floating-point value, for instance, 3.14.
Range of available values:
-1.79769313486232×10308 … -4.94065645841247×10-324 for negative values;
4.94065645841247×10-324 … 1.79769313486232×10308 for positive values.
String Any textual data
Character A single character or an arbitrary string
Boolean A value that is either True, or False.
Date/Time A value that stores a date, time or both date and time. This special type is used to perform various operations over date and time values easily.
Alphanumeric A string consisting of numbers, alphabets and/or special characters


Additionally:
  1. List each field and its data type
  2. If you have structured data (excel spreadsheet, for example, list each column heading and the type of data it holds)
  3. User defined data types are also allowed

Reference: Visualizing Data by Ben Fry (Chapter 1, p-8-9).


Assumptions:
What aspect of the process are you treating as truth?

Resources

Basic Data Types Description
Integer An integer number, no decimal points, for example: from -2147483648 to 2147483647
Float A number with decimal points (for example, used with latitudes and longitudes for determining location.

An example of a floating-point value, is, 3.14. Real is used in some programming languages or is called Double in other programming languages.
Character A single letter or other symbol.
String Any textual data or set of characters that forms a word or a sentence.
Examples might include: city, state, address
Alphanumeric Consisting of both letters and numbers and often other symbols (such as punctuation marks and mathematical symbols).

An example of an alphanumeric string: strong passwords.
Boolean A value that is either True, or False.


Use the following external links for additional resources.




Practice Quiz

Instructions
Choose an answer and hit 'Next Question'. You will receive your score and answers at the end of the quiz.
Download Quiz
Click on "Download" to save a copy of the practice quiz.

Worksheet

Practice the module by completing the worksheet and revise what you learnt.



Self Assessment

Complete this assessment to demonstrate your current knowledge of the Parse stage:

Prerequisites: Make sure to finish the following tasks before working on this assessment.



Review

Horizonal assessment of the parse stage across the data visualization process mapped to Bloom’s Taxonomy of Hierarchical Learning

What you should know:

Bloom’s Taxonomy Hierarchy You should know
Remember What parsing means.
Understand The basic data types. See Resources below.
Apply How to break data into its most basic parts.
Evaluate Categorize the data according to the parsing specs.
Analysis Identify specific features about the data.
Create Plan, generate, and produce a parsed listing of the data.


What you should be able to do:

Bloom’s Taxonomy Hierarchy You should be able to do
Remember Describe what happens in the parse stage
Understand Describe the data in detail according to the parsing specifications.
Apply Demonstrate the ability to change data into a useful format for processing.
Evaluate Categorize the data according to parsing specs.
Analysis Identify specific features about the data.
Create Generate a parsed listing of the data.


You are here:
  • Acquire
  • Parse
  • Mine
  • Sketching & Ideation
  • Filter
  • Represent
  • Critique
  • Refine
  • Interact